Winter 2018
After the colons (in the same line) please write just your first name, last name, and the 9 digit student ID number below.
First Name: Zhihao
Last Name: Meng
Student ID: 915537642
We use a script that extracts your answers by looking for cells in between the cells containing the exercise statements. So you
To make markdown, please switch the cell type to markdown (from code) - you can hit 'm' when you are in command mode - and use the markdown language. For a brief tutorial see: https://daringfireball.net/projects/markdown/syntax
In this assignment, you'll analyze a collection of data sets from the San Francisco Open Data Portal and Zillow. The data sets have been stored in the SQLite database sf_data.sqlite, which you can download here. The database contains the following tables:
| Table | Description |
|---|---|
crime |
Crime reports dating back to 2010. |
mobile_food_locations |
List of all locations where mobile food vendors sell. |
mobile_food_permits |
List of all mobile food vendor permits. More details here. |
mobile_food_schedule |
Schedules for mobile food vendors. |
noise |
Noise complaints dating back to August 2015. |
parking |
List of all parking lots. |
parks |
List of all parks. |
schools |
List of all schools. |
zillow |
Zillow rent and housing statistics dating back to 1996. More details here. |
The mobile_food_ tables are explicitly connected through the locationid and permit columns. The other tables are not connected, but you may be able to connect them using dates, latitude/longitude, or postal codes.
Shapefiles for US postal codes are available here. These may be useful for converting latitude/longitude to postal codes.
Shapefiles for San Francisco Neighborhoods are available here.
Exercise 1.1. (20 pts) Which mobile food vendor(s) sells at the most locations?
import sqlite3 as sql
import pandas as pd
db = sql.connect('/ucdavis/lecture/STA141B/141b-hw5--Zhihao2018/sf_data.sqlite')
food_loc = pd.read_sql("SELECT * FROM mobile_food_locations",db).dropna()
food_per = pd.read_sql("SELECT * FROM mobile_food_permits",db)
food_sch = pd.read_sql("SELECT * FROM mobile_food_schedule",db)
#Create a table to count the location.
db.execute("CREATE TABLE locidper AS SELECT locationid, permit,COUNT(*) AS count FROM mobile_food_schedule GROUP BY locationid")
#db.execute("DROP TABLE locidper")
#Join the table mobile_food_location and mobile_food_permits by the column locationid.
db.execute("CREATE TABLE mobile_food_locidper AS SELECT L.locationid, R.* FROM locidper AS L INNER JOIN mobile_food_permits AS R ON L.permit = R.permit")
#db.execute("DROP TABLE mobile_food_locidper")
#Select the trunks which have been approved
db.execute("CREATE TABLE approve AS SELECT * FROM mobile_food_locidper WHERE Status = 'APPROVED'")
#db.execute("DROP TABLE approve")
db.execute("CREATE TABLE loc_count AS SELECT locationid,permit,Applicant,COUNT(locationid) AS num FROM approve GROUP BY permit")
#db.execute("DROP TABLE loc_count")
pd.read_sql("SELECT permit,Applicant,MAX(num) FROM loc_count",db)
Exercise 1.2. (40 pts) Ask and use the database to analyze 5 questions about San Francisco. For each question, write at least 150 words and support your answer with plots. Make a map for at least 2 of the 5 questions.
You should try to come up with some questions on your own, but these are examples of reasonable questions:
Please make sure to clearly state each of your questions in your submission.
import matplotlib.pyplot as plt
import numpy as np
import geopandas as gpd
import osmnx as ox
import shapely.geometry as geom
postcode = gpd.read_file("cb_2016_us_zcta510_500k/cb_2016_us_zcta510_500k.shp")
sfo = gpd.read_file("SF Find Neighborhoods/geo_export_b90f6349-9fa6-4740-ae5f-dba458ee365b.shp")
postcode.head()
sfo.head()
Question 1
zillow = pd.read_sql("SELECT * FROM zillow",db)
zillow.head()
price = pd.read_sql("SELECT AVG(MedianSoldPricePerSqft_AllHomes) AS avg, RegionName FROM zillow GROUP BY RegionName",db)
price[price['avg']==max(price['avg'])]
price[price['avg']==min(price['avg'])]
highest=postcode[np.int64(postcode['ZCTA5CE10'])==np.int64(94104)]
lowest=postcode[np.int64(postcode['ZCTA5CE10'])==np.int64(94124)]
sfo_streets = ox.graph_from_place("San Francisco", network_type = "drive")
nodes, edges = ox.graph_to_gdfs(sfo_streets)
ax = sfo.plot(color = "grey")
highest.plot(ax = ax)
plt.show()
ax = sfo.plot(color = "grey")
lowest.plot(ax = ax)
plt.show()
The TABLE zillow contains the price information of houses in San Francisco. I select the column MedianSoldPricePerSqft_AllHomes as the criterion of verify the most and least expensive areas in San Francisco. And the postcode database privide the coordinates of latitude and longitude according to the zip codes. First, I find the zip codes most and least expensive areas in San Fancisco using the TABLE zillow and the functions max and min. Then I connect the zip codes to the postcode dataframe and find the coordinates of latitude and longitude of the most and least expensive areas. Finally, I plot them using the sfo dataframe which contaions the coordinates of latitude and longitude of every area in San Francisco according to the zip codes. In other words, I plot the most and least expensive areas on the map of San Francisco.
From the two plots above, we can see that the most expensive area is at the northeast of San Francisco. The least expensive area is at the southeast of San Francisco.
Question 2
crime = pd.read_sql("SELECT * FROM crime",db)
crimetime = crime['Datetime']
crime.head()
#Drop the date off from the elements in the column Datetime
Time=[]
for i in range(0,len(crimetime)):
daytime = crimetime[i].split(' ',1)[1]
Time.append(daytime)
Time = pd.DataFrame(Time)
Time.columns = ['Time']
crime["Time"]=Time.iloc[:,0]
#Insert a new column Time to the TABLE crime.
crime.copy().to_sql('crimetime',db)
#db.execute("DROP TABLE crimetime")
#Drop the duplicated cases
db.execute("CREATE TABLE exactcrime AS SELECT IncidntNum,DayOfWeek,Time,PdDistrict,Lon,Lat FROM crimetime GROUP BY IncidntNum")
#db.execute("DROP TABLE exactcrime")
#Count the crimes by PdDistrict, DayOfWeek and Time.
db.execute("CREATE TABLE Districtnum AS SELECT *,COUNT(PdDistrict) AS count FROM exactcrime GROUP BY PdDistrict,DayOfWeek,Time")
#db.execute("DROP TABLE Districtnum")
pd.read_sql("SELECT *,MAX(count) AS max FROM Districtnum",db)
District = pd.read_sql("SELECT * FROM exactcrime WHERE DayOfWeek='Saturday' AND Time='23:00:00'",db)
District.head()
lonlat = [geom.Point(x) for x in zip(District.Lon, District.Lat)]
District = gpd.GeoDataFrame(District, geometry = lonlat)
ax = edges.plot(color = "gray", linewidth = 0.6, figsize = (16, 16))
District.plot(ax = ax)
plt.show()
District1=District[District['PdDistrict']=="SOUTHERN"]
ax = edges.plot(color = "gray", linewidth = 0.6, figsize = (16, 16))
District1.plot(ax = ax)
ax.set_xlim([min(District1.Lon), max(District1.Lon)])
ax.set_ylim([min(District1.Lat), max(District1.Lat)])
plt.show()
The TABLE crime contains the crime reports of every day and every time. First, I make some changes on the column Datetime. I drop the dates and only reserve the times of every crime reprots because I think dates have no meanings for my analysis. Second, there are repeated crime reports in the dataframe and I drop all the duplicated cases according to the column IncidnNum. Third, I count the number of crimes of every districts and group the data by DateOfWeek,Times and PdDistricts, then I get the number of crimes for every district,weekday and time. The location and time of the maximum number of crimes is SOUTHERN at 23:00 on Saturday, where there are 706 crimes for these years in total.
Therefore, the most dangerous place is SOUTHERN and the most dangerous time is 23:00 on Saturday. The first plot above is the place of every crimes occured at 23:00, Saturday. We can see that most of them occured at the northeast of San Francisco. The second plot above is the cirmes occured at SOUTHERN at 23:00, Saturday.
Question 3
noise = pd.read_sql("SELECT * FROM noise",db).dropna()
noise.head()
lonlat1 = [geom.Point(x) for x in zip(noise.Lon, noise.Lat)]
noise = gpd.GeoDataFrame(noise, geometry = lonlat1)
ax = edges.plot(color = "gray", linewidth = 0.6, figsize = (16, 16))
noise.plot(ax = ax,color = "green")
plt.show()
food_loc.head()
lonlat2 = [geom.Point(x) for x in zip(food_loc.Longitude, food_loc.Latitude)]
food_loc = gpd.GeoDataFrame(food_loc, geometry = lonlat2)
ax = edges.plot(color = "gray", linewidth = 0.6, figsize = (16, 16))
food_loc.plot(ax = ax,color = "purple")
ax.set_xlim([min(noise.Lon), max(noise.Lon)])
ax.set_ylim([min(noise.Lat), max(noise.Lat)])
plt.show()
ax = edges.plot(color = "gray", linewidth = 0.6, figsize = (16, 16))
noise.plot(ax = ax,color = "green")
food_loc.plot(ax = ax,color = "purple")
ax.set_xlim([min(noise.Lon), max(noise.Lon)])
ax.set_ylim([min(noise.Lat), max(noise.Lat)])
plt.show()
The TABLE noise contains the noise data which I name as noise and the TABLE mobile_food_locatioin contains the data of the locations of food trunks which I name as food_loc. I transfer the dataframes noise and food_loc to geodataframes and plot them. The first plot above is the noise plot. We can see that the noises distribute around the city but the density of noises is the highest at the northeastern of the city. The second plot above is the food_loc plot. We can see that the food trunks mainly distribute at the east of the city espeacilly at the northeastern part. From the third plot, we can find that there are few relations between the noises and the locations of food trunks because the area where the density of noises is the high is not same as the area where the density of food trunks is the high. The food trunks is only one of reasons of noises and not the main reason of noises because noises distribute nearly the whole city but most of food trunks are at the eastern part of the city.
Question 4
food_sch.head()
#Count the trunks by the column DayofWeek,EndHour and StartHour.
db.execute("CREATE TABLE Timenum AS SELECT DayOfWeek,EndHour,StartHour,COUNT(DayOfWeek) AS count FROM mobile_food_schedule GROUP BY DayOfWeek, EndHour, StartHour")
#db.execute("DROP TABLE Timenum")
pd.read_sql("SELECT * ,MAX(count) FROM Timenum",db)
In last question, I have plotted the location of food trunks. From the plot in last part, we can find that the best places to find food trunks is the northeastern part of the city. To be concrete, it is about in the area of latitudet between 37.76 and 37.80 and longitude between -122.44 and -122.38.
The TABLE mobile_food_schedule contains the data of schedules of food trunks which I name as food_sch. I count the column DayOfWeek and group by DayOfWeek, EndHour and StartHour. The time of the maximum number of food trunks is 9:00am to 10:00am on Monday. There are 105 food trunks in total during this time period.
Therefore, the best time to find food trunks is 9:00am to 10:00am, Monday and the best places to find food trunks is in the area of latitude between 37.76 and 37.80 and longitude between -122.44 and -122.38, which is at the northeastern part of the city.
Question 5
parking = pd.read_sql("SELECT * FROM parking",db)
lonlat3 = [geom.Point(x) for x in zip(parking.Lon, parking.Lat)]
parking = gpd.GeoDataFrame(parking, geometry = lonlat3)
parking.head()
ax = edges.plot(color = "gray", linewidth = 0.6, figsize = (16, 16))
parking.plot(ax = ax,color = "red")
highest.plot(ax = ax)
ax.set_xlim([min(noise.Lon), max(noise.Lon)])
ax.set_ylim([min(noise.Lat), max(noise.Lat)])
plt.show()
schools = pd.read_sql("SELECT * FROM schools",db)
lonlat4 = [geom.Point(x) for x in zip(schools.Lon, schools.Lat)]
schools = gpd.GeoDataFrame(schools, geometry = lonlat4)
schools.head()
ax = edges.plot(color = "gray", linewidth = 0.6, figsize = (16, 16))
schools.plot(ax = ax,color = "yellow")
highest.plot(ax = ax)
ax.set_xlim([min(noise.Lon), max(noise.Lon)])
ax.set_ylim([min(noise.Lat), max(noise.Lat)])
plt.show()
I want to find if there are relationships between house prices and parkings and house prices and schools. The TABLE parkings contains the data of parking lots which I name as parking. The TABLE schools contains data of schools which I name as schools. The first plot above is the distribution of parking lots and the area with highest house prices. We can see that the area with highest house prices is exactly at the area where the density of parking lots is the highest. Therefore, we can say that there is a significant relationship between house prices and parking lots. The areas owning more parking lots will have higher house prices.
The second plot above is the distribution of schools and the area with highest house prices. We can see that the schools distribute nearly uniformly among the city. So there are few relationships between house prices and the distribution of schools.
db.close()